# Multimodal Retrieval

Fg Clip Base
Apache-2.0
FG-CLIP is a fine-grained visual and text alignment model that achieves global and region-level image-text alignment through two-stage training.
Text-to-Image Transformers English
F
qihoo360
692
2
CLIP ViT L 14 Spectrum Icons 20k
MIT
A vision-language model fine-tuned based on CLIP ViT-L/14, optimized for abstract image-text retrieval tasks
Text-to-Image TensorBoard English
C
JianLiao
1,576
1
Prolip ViT B 16 DC 1B 12 8B
MIT
Probabilistic Language-Image Pretraining (ProLIP) ViT-B/16 model pretrained on the DataComp 1B dataset
Text-to-Image Safetensors
P
SanghyukChun
460
0
Jina Clip V2
Jina CLIP v2 is a versatile multilingual multimodal embedding model suitable for text and images, supporting 89 languages, with higher image resolution and nested representation capabilities.
Text-to-Image Transformers Supports Multiple Languages
J
jinaai
47.56k
219
CLIP GmP ViT L 14
MIT
A fine-tuned model based on OpenAI CLIP ViT-L/14, achieving performance improvements through Geometric Parametrization (GmP), with special optimization for text encoding capabilities
Text-to-Image Transformers
C
zer0int
6,275
433
Pmc Vit L 14 Hf
A vision-language model fine-tuned on the PMC-OA dataset based on CLIP-ViT-L/14
Text-to-Image Transformers
P
ryanyip7777
260
1
CLIP ViT B 16 DataComp.XL S13b B90k
MIT
This is a CLIP ViT-B/16 model trained using OpenCLIP on the DataComp-1B dataset, primarily used for zero-shot image classification and image-text retrieval.
Text-to-Image
C
laion
4,461
7
Arabic Clip Vit Base Patch32
Arabic CLIP is an adapted version of the Contrastive Language-Image Pre-training (CLIP) model for Arabic, capable of learning concepts from images and associating them with Arabic text descriptions.
Text-to-Image Arabic
A
LinaAlhuri
33
2
CLIP ViT Bigg 14 Laion2b 39B B160k
MIT
A vision-language model trained on the LAION-2B dataset based on the OpenCLIP framework, supporting zero-shot image classification and cross-modal retrieval
Text-to-Image
C
laion
565.80k
261
CLIP Convnext Base W Laion2b S13b B82k Augreg
MIT
CLIP model based on ConvNeXt-Base architecture, trained on a subset of LAION-5B using OpenCLIP, focusing on zero-shot image classification tasks
Text-to-Image TensorBoard
C
laion
40.86k
7
Taiyi CLIP RoBERTa 102M ViT L Chinese
Apache-2.0
The first open-source Chinese CLIP model, pre-trained on 123 million text-image pairs, with a text encoder based on the RoBERTa-base architecture.
Text-to-Image Transformers Chinese
T
IDEA-CCNL
668
19
CLIP ViT H 14 Laion2b S32b B79k
MIT
A vision-language model trained on the LAION-2B English dataset based on the OpenCLIP framework, supporting zero-shot image classification and cross-modal retrieval tasks
Text-to-Image
C
laion
1.8M
368
CLIP ViT L 14 Laion2b S32b B82k
MIT
A vision-language model trained on the English subset of LAION-2B using the OpenCLIP framework, supporting zero-shot image classification and image-text retrieval
Text-to-Image TensorBoard
C
laion
79.01k
48
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase